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Abstract 

Large multilayer neural networks trained with 
backpropagation have recently achieved state-of- 
the-art results in a wide range of problems. How¬ 
ever, using backprop for neural net learning still 
has some disadvantages, e.g., having to tune a 
large number of hyperparameters to the data, 
lack of calibrated probabilistic predictions, and 
a tendency to overfit the training data. In prin¬ 
ciple, the Bayesian approach to learning neural 
networks does not have these problems. How¬ 
ever, existing Bayesian techniques lack scalabil¬ 
ity to large dataset and network sizes. In this 
work we present a novel scalable method for 
learning Bayesian neural networks, called proba¬ 
bilistic backpropagation (PBP). Similar to classi¬ 
cal backpropagation, PBP works by computing 
a forward propagation of probabilities through 
the network and then doing a backward computa¬ 
tion of gradients. A series of experiments on ten 
real-world datasets show that PBP is significantly 
faster than other techniques, while offering com¬ 
petitive predictive abilities. Our experiments also 
show that PBP provides accurate estimates of the 
posterior variance on the network weights. 


1. Introduction 


Neural networks (NNs) have seen a recent resurgence of 
interest due to empirical achievements on a wide range of 
supervised learning problems. In their typical usage, neural 
networks are highly expressive models that can learn com¬ 
plex function approximations from input/output examples 
( |Hornik et al.||l989] ). Part of the success of NNs is due to 
the ability to train them on massive data sets with stochastic 
optimization (Bottou 


2010) and the backpropagation (BP) 


algorithm (Rumelh art et al.| 1986). This, along with faster 
machines, larger datasets, and innovations such as dropout 
( Srivastava et al.| |2014) and rectified linear units ( |Nair &| 
linton 2010 >, have resulted in successes for NNs on tasks 


et al. 


such as speech recognition (Hinton et al. 2012[ Hannun 


et al. 


2015 i and natural language processing (Collobert & 


20141, computer vision (Krizhevsky et al. 2012) Wu 


Weston 2008|[Sutskever et al.||2014] >. 


Despite all these successes, there are still some challenges 
in learning NNs with backpropagation (BP). First, there are 
many hyperparameters in BP-based stochastic optimization 
that require specific tuning, e.g., learning rate, momentum, 
weight decay, etc., each of which may be layer-specific. 
With large data sets, finding the optimal values can take a 
large amount of time, even with an efficient procedure such 
as Bayesian optimization (Snoek et al., 2012| >. Second, in 
NNs trained with BP, we can only obtain point estimates 
of the weights in the network. As a result, these networks 
make predictions that do not account for uncertainty in the 
parameters. However, in many cases these weights may 
be poorly specified and it is desirable to produce uncer¬ 
tainty estimates along with predictions. Finally, it is com¬ 
mon practice to use a very large NN to flexibly fit data, and 
then reign in overfitting using regularization terms, even if 
a smaller network would be cheaper and easier to train. 

A Bayesian approach to neural networks can potentially 
avoid some of the pitfalls of stochastic optimization 
(MacKay 1992c) . Bayesian techniques, in principle, can 
automatically infer hyperparameter values by marginaliz¬ 
ing them out of the posterior distribution or by determining 
them via type II maximum likelihood (empirical Bayes). 
Furthermore, Bayesian methods naturally account for un¬ 
certainty in parameter estimates and can propagate this un¬ 
certainty into predictions. Finally, Bayesian approaches are 
often more robust to overfitting, since they average over pa¬ 
rameter values instead of choosing a single point estimate. 

Several approaches have been proposed for Bayesian learn¬ 
ing of neural networks, based on, e.g., the Laplace approxi¬ 
mation ( |MacKay|| 1992c) , Hamiltonian Monte Carlo ( Neal| 
1995) , expectation propagation (Jylanki et al. 2014 1 , and 
variational inference (Hinton & Camp 1993) . However, 
these approaches have not seen widespread adoption due 
to their lack of scalability in both network architecture 
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and data size. A notable exception is the scalable varia¬ 
tional inference approach of Graves ( 2011[ >. However, this 
method seems to perform poorly in practice due to noise 
from Monte Carlo approximations within the stochastic 
gradient computations. A different scalable solution based 
on expectation propagation was proposed by Soudry et al. 


( 2014} . While this method works for networks with binary 
weights, its extension to continuous weights is unsatisfying 
as it does not produce estimates of posterior variance. 


We describe a new approach for learning Bayesian neural 
networks called probabilistic backpropagation (PBP) that 
is fast and does not have the disadvantages of previous ap¬ 
proaches. PBP works by propagating probabilities forward 
through the network to obtain the marginal likelihood, be¬ 
fore propagating backward the gradients of the marginal 
likelihood with respect to the parameters of the posterior 
approximation. Our experiments show that PBP is fast, 
makes accurate predictions and also produces calibrated es¬ 
timates of the posterior uncertainty in network weights. 


2. Probabilistic neural network models 

We describe a probabilistic model for data based on a feed¬ 
forward neural network. Given data V = {x n ,y n }^ =1 , 
made up of /4-dimensional feature vectors x„ £ R D and 
corresponding scalar target variables y n £ R, we as¬ 
sume that each y n is obtained as y n = /(x n ;W) + e n , 
where /(•; W) is the output of a multi-layer neural network 
with connections between consecutive layers and weights 
given by W. The evaluations of this NN are corrupted by 
additive noise variables e„, where e n ~ A/”(0,7 _1 ). 

The NN has L layers, where V) is the number of hid¬ 
den units in layer l, and W = { W; is the collec¬ 

tion of V[ x (Vi-i + 1) weight matrices between the fully- 
connected layers. The +1 is introduced here to account 
for the additional per-layer biases. We denote the out¬ 
puts of the layers by vectors {zi}fL 0 , where Zq is the 
input layer, {zijfZj 1 are the hidden units and z/ y de¬ 
notes the output layer, which is one-dimensional since 
the y n are scalars. The input to the Z-th layer is 
defined as a/ = W;Z;_i/^/Vj_i + 1, where the fac¬ 
tor l/\/V/_i + 1 keeps the scale of the input to each 
neuron independent of the number of incoming connec¬ 
tions. The activation functions for each hidden layer 
are rectified linear units (ReLUs) (Nair & Hint on} |2010[ >, 
i.e., a(x) = max(r, 0). 

Let y be an iV-dimensional vector with the targets y n 
and X be an N x D matrix of feature vectors x„. The 
likelihood for the network weights W and the noise preci¬ 


sion 7 , with data T> = (X, y) is then 

N 

P(y I W, X, 7 ) = J\f(y n | /(x„; W), 7 _1 ). (1) 

n—1 

To complete our probabilistic model, we specify a Gaus¬ 
sian prior distribution for each entry in each of the weight 
matrices in W. In particular, 

L H Vi-i+l 

p( W iA)=nn n ^k-.zio.a- 1 ), ( 2 ) 

;=ii=l j =1 

where is the entry in the i-th row and j-th col¬ 

umn of W; and A is a precision parameter. The 
hyper-prior for A is chosen to be a gamma distribution, 
i.e., p{ A) = Gam(A | ao> At) with shape «q = 6 and in¬ 
verse scale (3q = 6 . The values chosen for a g and (3q 
make this equivalent to having observed v = 12 samples 
from AF( 0, A -1 ) with empirical variance equal to 1. This 
relatively low value for v compared to the large num¬ 
ber N of observed data points makes this prior weakly- 
informative. The prior for the noise precision 7 is also 
gamma: p( 7 ) = Gam (7 | 0 %, 0$). We assume that the y n 
have been normalized to have unit variance and, as above, 
we fix ctQ = 6 and 0$ = 6 . 

The posterior distribution for the parameters W, 7 and A 
can then be obtained by applying Bayes’ rule: 

ns, M-ro p(y| W,X, 7 M>V| X)p{X)p('y) 
rtW,-,,\\V) = --. (3) 

where p(y | X) is a normalization constant. Given a new 
input vector x*, we can then make predictions for its out¬ 
put j/* using the predictive distribution given by 

p{y* |x*,x>)= j p( 2 /*|x*W, 7 )p(W, 7 , X\T>) dri dX dW, (4) 

where | x*, W, 7 ) = 7V(y* | /(x*), 7 ). However, the 
exact computation of p(W, 7 , A | T>) and p(y* | x+) is not 
tractable in most cases. Therefore, in practice we have to 
resort to approximate inference methods. In the following 
section we describe a technique for approximate Bayesian 
inference in NN models that is both fast and also offers 
excellent predictive performance. 

3. Probabilistic backpropagation 

Backpropagation ( |Rumelharte t al. | 1986) is by far the most 
common method for training neural networks. This method 
operates in two phases to compute a gradient of the loss in 
terms of the network weights. In the first phase, the in¬ 
put features are propagated forward through the network to 
compute the function output and thereby the loss associ¬ 
ated with the current parameters. In the second phase, the 
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derivatives of the training loss with respect to the weights 
are propagated back from the output layer towards the in¬ 
put. These derivatives are used to update the weights using, 
e.g., stochastic gradient descent with momentum. 


In this section we describe a probabilistic alternative to 
the backpropagation algorithm, which we call probabilis¬ 
tic backpropagation (PBP). PBP does not use point esti¬ 
mates for the synaptic weights in the network. Instead, it 
uses a collection of one-dimensional Gaussians, each one 
approximating the marginal posterior distribution of a dif¬ 
ferent weight. PBP also has two phases equivalent to the 
ones of BP. In the first phase, the input data is propagated 
forward through the network. However, since the weights 
are now random, the activations produced in each layer are 
also random and result in (intractable) distributions. PBP 
sequentially approximates each of these distributions with 
a collection of one-dimensional Gaussians that match their 
marginal means and variances. At the end of this phase, 
PBP computes, instead of the prediction error, the loga¬ 
rithm of the marginal probability of the target variable. In 
the second phase, the gradients of this quantity with respect 
to the means and variances of the approximate Gaussian 
posterior are propagated back using reverse-mode differen¬ 
tiation as in classic backpropagation. These derivatives are 
finally used to update the means and variances of the pos¬ 
terior approximation. 


The update rule used by PBP is not the standard step in 
the direction of the gradient of the loss made by the classic 
backpropagation algorithm. PBP uses the following prop¬ 
erty of Gaussian distributions (Minka, 2001). Let f(w) en¬ 
code an arbitrary likelihood function for the single synap¬ 
tic weight w given some data and let our current be¬ 
liefs regarding the scalar w be captured by a distribu¬ 
tion q(w) = Af{w | m, v). After seeing the data, our beliefs 
about w are updated according to Bayes’ rule: 


s(w) = Z 1 f(w) Af(w \m,v), 


(5) 


where Z is the normalization constant. The updated be¬ 
liefs s(w) usually have a complex form and need to be ap¬ 
proximated with a simpler distribution. A common choice 
is to approximate this posterior with a distribution that has 
the same form as q. In this case, the parameters of the new 
Gaussian beliefs g new (u>) = Af(w | m new , u new ) that mini¬ 
mize the the Kullback-Leibler (KL) divergence between s 
and q new can then be obtained as a function of m, v and the 
gradient of log Z with respect to these quantities, namely 


new , dlogZ 

m = m + v - 


dm 


v new = v - v 2 


fdlogZY 
\ dm ) 


- 2 


d log Z 
dv 


( 6 ) 

(7) 


See (Minka 2001), equations 5.12 and 5.13. These rules 
match moments between q new and s, guaranteeing that 


these two distributions have the same mean and variance. 
These are the main update equations used by PBP. The next 
section provides a detailed description of PBP, presenting 


it as an assumed density filtering (ADF) method (Opper 


|& Winther| |1998| ) that uses some of the improvements on 
ADF given by expectation propagation (Minka, 2001). 


3.1. PBP as an assumed density filtering method 

Probabilistic backpropagation is an inference method that 
approximates the exact posterior of a neural network ([3]) 
with a factored distribution given by 


?(W,7, A) = ULiUZiU^Lf 1 


x Gam (7 | ct 7 , /? 7 )Gam(A | a A , fi x ). ( 8 ) 


The approximation parameters Vij i, a y , /3 7 , a A 

and 3 X are determined by applying an assumed density 
filtering method ( |Opper & Winther| |1998[ |Minka| |2001| ) 
on the posterior (M. For this, (Mi is first initialized to 
be uniform, that is, = 0, Vijj = oo, a 1 = a x = 1 
and /? 7 = fi x = 0. After this, PBP iterates over the fac¬ 
tors in the numerator of ([3]) and sequentially incorporates 
each of these factors into the approximation in ([8]). A gen¬ 
eral description of this operation is given in the following 
paragraph. Specific details on how to incorporate each type 
of factor in ([3| are given in the following sections. 

There are two factors for the priors on 7 and A, a total 
of nf = i Vi(Vi~ l+l) factors for the prior on VV given by (|2l) 
and finally, N factors for the likelihood (JT|l. Let f(W. A, 7) 
be one of these factors. PBP incorporates / into the current 
posterior approximation ([8]) by minimizing the KL diver¬ 
gence between s(W, 7, A) = Z~ 1 f(W , A, 7)g(W, 7, A) 
and g(W,7, A), with respect to the parameters of q , 
where Z normalizes s and the q used to construct s is kept 
constant during the minimization of the KL divergence. 
This makes the new q approximate the product of the old q 
and the factor /. The result is an approximation to an exact 
online learning algorithm within the Bayesian framework 
( |Opper & Winther||1998| l. 


3.2. Incorporating the prior factors into q 

The first factors to be incorporated are the priors on 7 
and A. Since these factors have the same functional 
form as <§■ the resulting update for q is straightfor¬ 
ward: a„ew = 0 %, /?ne W = /3j, a^ ew = and /3 x evl = 

After this, we sequentially incorporate the factors in (j2|. 
In this case, the updates for m,, / and v l:l / in ([ 8 ji are given 
by ([ 6 ]) and (jTJ. Similar update rules can be obtained for the 
parameters a x and B x in ([sji. In particular, 

a n A ew = [ZZ 2 Zfi 2 (a x + l)/a x - l.O ]" 1 , (9) 
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fa w =[Z 2 Zi\a x + l)/p x -Z 1 Z~ 1 a x /p x \ 1 , (10) 

where Z is the normalizer of s, that is, the product of the 
factor that is being incorporated and q and Z\ and Z 2 are 
the values of this normalizer when the parameter o A in q 
is increased by one and two units, respectively. The up¬ 
date rules |9]) and ( [TO} do not exactly minimize the KL di¬ 
vergence since that would require matching the sufficient 
statistics for A in q and s, which does not have a closed 
form. Instead, the rules above match the first and second 
moments of A, which also produces good results ( |Minka| 
1 2001 [ [Co we 11 et aLj|1996| l. The derivation of |9]) and ( fTO] ) 
can be found in the supplementary material. One difficulty 
when applying the update rules just described is that the 
normalizer Z of s does not have a closed form. Neverthe¬ 
less, we can approximate Z using 

Z = /A r( Wi j,i | 0, A _1 )g(W, 7, A) dW d-y d\ 

= f Af(wij,i | 0, A _1 )A| 
x Gam(A | a x , /3 x )dwij t i dX 
— y P .1 | 0, /3 J(X , 2a )A( (iJJijj | 771ij ,1 ■, Vij i^jdWij 
~ / A r(wij,i | 0, /3 A /(a A - 1))A r{wij,i \ mij : i,Vij,i)dwij,i 
= A r(mij,i | 0, /3 A /(a A - 1) + v ijt i). (11) 

where T(-1 /x, /?, v) denotes a Student’s t distribution with 
mean /x, variance parameter /3 and degrees of freedom v. In 
the next-to-last line we have approximated the Student’s t 
density with a Gaussian density that has the same mean and 
variance. Finally, Z\, Z> and the gradients of log Z with 
respect to rriijj and u l:j j can be similarly approximated 
by incorporating this approximation of Z into their expres¬ 
sions. By plugging in the resulting quantities in ([6]), ([7]), ([9]) 
and ( fTO} we obtain the new parameter values for q. 

3.3. Incorporating the likelihood factors into q 

After incorporating all the factors in Q, PBP sequentially 
incorporates the N factors for the likelihood <|T]». As before, 
updates for all the m (J ,z and v, :) j in ([8ji are given by ([6]) 
and 0- respectively. The updates for a 1 and fP in 0 are 
given by 0 and ( p~0] >, respectively. To compute all these 
updates we only require Z , the normalization constant 
of s. However, this is difficult to compute, as it requires 
integration of each likelihood factor with respect to the 
distribution of the network output, i.e., ZL = /(x n I W), 
when W ~ q. Let us assume that we have an approximat¬ 
ing Gaussian with mean m ZL and variance v ZL for the dis¬ 
tribution of zl- We can then approximate Z as 

Z = ftf(y n | /(x„ | W), 7 - 1 MW, 7, A) dW d 7 , dX 
~ IP{yn\z L , , y~ 1 W{z L \m ZL ,v ZL )Gam('y \ P,P)z L d'y 
= f T(y n | ZL, P/a\ 2 oF)N{z l I m ZL , v ZL ) dz L 
~ P’iVn. | m, ZL , fPI(oP — 1) + v ZL ) (12) 


where the first approximation in 0 assumes that 
Zl = /(xj | W) ~ N(m ZL , v ZL ) when W ~ q and the 
second approximates the Student’s t density with a Gaus¬ 
sian density that has the same mean and variance. An anal¬ 
ysis of the error of this latter approximation can be found 
in the supplementary material. This expression for Z can 
be substituted into ([6|, (|7j, ([9]) and ( | 1 ()[ > to obtain the new 
parameters for q. 

However, it remains to compute the mean and variance pa¬ 
rameters m ZL and v ZL in ( p~2| ). This is done by propagating 
distributions forward through the network and, when nec¬ 
essary, approximating each new distribution with a Gaus¬ 
sian. For this, let us assume that, when W ~ q, the out¬ 
put of the l — 1 layer z/_-| is a diagonal Gaussian with 
means and variances given by the V)_i-dimensional vec¬ 
tors m z ' 1 and v Zi_1 , respectively. Furthermore, let a; = 
Wiz^/v^ITFT. so that the marginal means and vari¬ 
ances of a i (when VV is distributed as q) are 

m a * = Mzm 2 '- 1 /+ l , (13) 

v a * = [(M|oM|)v ZH + V^m 2 '^ 1 o m 2 '- 1 ) 

+ v i v Zi - i ]/(y i _ 1 + i) (14) 


where M; and V; are V) x (V)_i + 1) matrices whose 
entries are given by rrijji and for i = 1 ,,Vi 

and j = l,..., V;_ i + 1, respectively, and o denotes the 
Hadamard elementwise product. We again assume that the 
entries in a i are independent Gaussian with means and vari¬ 
ances given by the equations above. The Central Limit The¬ 
orem states that a; is approximately Gaussian when Vj_r 


is large (Soudry et al. 2014). Let b; = a(az), where a 


is the rectifier linear activation function a(x) = max(0, x). 
Then, the entries of b/ are a mixture of a point mass at 0 
(when the rectifier is saturated) and a Gaussian truncated at 
0 (when the rectifier is in the linear regime). The mean and 
variance of the i-th entry of b ( are then 

m* 1 = , (15) 

v j bi = + <f>(ai)u, ai (l-7i(7 i+ai)), (16) 


where 


<K-ai) 


and <1) and cf> are respectively the CDF and the density func¬ 
tion of a standard Gaussian. When a, is very large and 
negative the previous definition of 7* is not numerically 
stable. Instead, when o:, < —30, we use the approxima¬ 
tion 7 i = —on — a~ l + 2a~ 3 as recommended by Paquet 


et al. ( 2012j ). The output of the /-th layer, z;, is obtained by 
concatenating 1)/ with the constant 1 for the bias. We can 
therefore approximate the distribution of z; to be Gaussian 
with marginal means and variances 


m z " = [m bi 


v Zi = fv bi 


(17) 
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These concatenated means and variances reflect the lack 
on uncertainty in the “bias unit”. Finally, to compute 
the mean and variance parameters m ZL and v ZL in (12) 
we initialize m z ° to [x, : ; 1 ] and v z ° to 0 and then ap¬ 
ply ( [T3| , < p~4| >, ( p~5| ), ( p~6| > and ( fT7| > iteratively until we 
obtain m ZL = m* L and v Zl = v* L . This resembles the 
forward pass of the standard backpropagation algorithm. 
With m ZL and v ZL , we can evaluate the log of Z as given 
by ( [T2| and the gradients of that quantity that are required 
to apply rules and 0- This is similar to the reverse 
mode differentiation used in backpropagation. We provide 
a Theano-based (Bergstra et al. 2010) implementation of 
PBP at http : / / jmhl. org/ [ as well as a C version us¬ 
ing the gradients given in the supplementary material. 


dom initialization of weights in NNs that is usually done 
before learning with backpropagation. This operation di¬ 
rects our inference method towards one of the multiple 
symmetric modes of the posterior. 


Because the computation of Z in (12 1 is approximate, on 
rare occasions the variance parameters for some weights 
in <[8]» may be negative after incorporating one likelihood 
factor. When this happens, we undo the update for those 
weights and keep their previous mean and variance values. 
A similar operation is often done in EP when negative vari¬ 
ances arise in Gaussian approximate factors (Minka 2001| . 


4. Related Work 


3.4. Expectation propagation 


Expectation propagation (EP) (Minka 2QQ1| > improves on 
assumed density filtering by iteratively incorporating each 
factor multiple times. On each pass over the list of fac¬ 
tors, each factor is removed from the current posterior ap¬ 
proximation, re-estimated, and then reincorporated. Each 
iteration improves the accuracy of the posterior approxi¬ 
mation. The disadvantage of EP over ADF is that it needs 
to keep in memory all of the approximate factors, one for 
each exact factor in the numerator of the posterior. This is 
necessary, because each factor must be able to be removed 
and updated. With massive data sets, the number of likeli¬ 
hoods will be very large and it is not possible to store these 
factors in memory. Instead, we incorporate these factors 
multiple times, but without removing them from the cur¬ 
rent approximation. This is equivalent to doing multiple 
ADF passes through the data, treating each likelihood fac¬ 
tor as a novel example. A disadvantage of this approach is 
that it can lead to underestimation of the variance param¬ 
eters in ([8]i when too many passes are done over the data. 
Nevertheless, PBP is geared toward larger data sets, where 
only a reduced number of passes over the data (fewer than 
100) are possible. Note that we can afford to keep in mem¬ 
ory an approximate factor for each exact factor in the prior 
on the weights ([2ji, since the number and size of these ap¬ 
proximate factors are small. We therefore do one full EP 
update of these approximate factors for the prior after each 
ADF pass through the data. Details for this operation can 
be found in the in the supplementary material. These ap¬ 
proximate factors could also be updated more frequently, 
for example, each time we do an ADF pass through a small 
block of likelihood factors. 


3.5. Implementation details 

After incorporating the factors in 0 for the first 
time, we slightly perturb each mean parameter to^-j 
in 0 from the original value of 0 to be tij.i, 
where e^j ~ Af(0, 1/(VJ + 1)). This is similar to the ran¬ 


The gold standard method for Bayesian learning in neural 
networks is Hamilton Monte Carlo (HMC) ( |Neal[ |1995 |. 
However, this is a batch method that can perform poorly on 
large data sets. HMC also requires problem-specific tun¬ 
ing parameters such as the length and number of integra¬ 
tion steps. One alternative to MCMC inference in neural 
networks is the Laplace approximation ( [MacKay 1992c] l. 
However, the Laplace approximation requires computation 
of the inverse Hessian of the log likelihood, which can be 
infeasible to compute for large networks. Diagonal approx¬ 
imations to the Hessian are possible, but performance can 
deteriorate considerably. One alternative approach based 
on EP is described by Jylanki et al. ( 2014| >. This is a batch 
method that is not expected to scale to large data sets and, 
unlike PBP, it requires numerical quadrature. Jylanki keeps 
in memory several approximate factors for each data point, 
which is not feasible in large scale settings. Furthermore, 
by using latent variables, this method breaks each likeli¬ 
hood factor into additional sub-factors that are incorporated 
into the posterior approximation in multiple disconnected 
steps. PBP incorporates each likelihood factor in a single 
step, which is expected to be more accurate. 


A scalable variational inference (VI) method for neural net¬ 
works is described by Graves (j20TTj). This method maxi¬ 
mizes a lower bound on the marginal likelihood of the NN. 
The computation of this bound requires computing the ex¬ 
pectation of the log of the numerator of the exact poste¬ 
rior ([3J under a factorized Gaussian approximation. This 
is intractable in general, and so Graves (20111 proposes a 
Monte Carlo approximation to compute the lower bound, 
which is then optimized using a second approximation for 
stochastic gradient descent (SGD). While SGD is a com¬ 
mon approach to optimization of neural networks, the ini¬ 
tial approximation leads to inefficient use of data. As a 
result, the VI approach tends to generate poor solutions for 
larger data sets over which only a few passes are possible. 


The technique that is most closely related to PBP is 
the expectation-backpropagation (EBP) method described 
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Dataset 

N 

d 

Avg. Test RMSE and Std. Errors Avg. Test LL and Std. Errors 

VI " BP PBP " VI PBP 

Avg. Running Time in Secs 

VI BP PBP 

Boston Housing 

506 

13 

4.320±0.2914 

3.228±0.1951 

3.014±0.1800 -2.903±0.071 

-2.574=60.089 

1035 

677 

13 

Concrete Compression Strength 

1030 

8 

7.128±0.1230 

5.977±0.2207 

5.667i0.0933 -3.391±0.017 

-3.1613=0.019 

1085 

758 

24 

Energy Efficiency 

768 

8 

2.646±0.0813 

1.098±0.0738 

1.804±0.0481 -2.391±0.029 

-2.042i0.019 

2011 

675 

19 

Kin8nm 

8192 

8 

0.099±0.0009 

0.091i0.0015 

0.098±0.0007 0.897=60.010 

0.896i0.006 

5604 

2001 

156 

Naval Propulsion 

11,934 

16 

0.005±0.0005 

0.001i0.0001 

0.006=60.0000 3.734i0.116 

3.731i0.006 

8373 

2351 

220 

Combined Cycle Power Plant 

9568 

4 

4.327±0.0352 

4.182±0.0402 

4.124=60.0345 -2.890±0.010 

-2.837i0.009 

2955 

2114 

178 

Protein Structure 

45,730 

9 

4.842±0.0305 

4.539i0.0288 

4.732=60.0130 -2.992=60.006 

-2.973i0.003 

7691 

4831 

485 

Wine Quality Red 

1599 

11 

0.646±0.0081 

0.645±0.0098 

0.635=60.0079 -0.980=60.013 

-0.968i0.014 

1195 

917 

50 

Yacht Hydrodynamics 

308 

6 

6.887±0.6749 

1.182±0.1645 

1.0153=0.0542 -3.439=60.163 

-1.634i0.016 

954 

626 

12 

Year Prediction MSD 

515,345 

90 

9.034±NA 

8.932±NA 

8.879=6 NA -3.622=6NA 

-3.603i NA 

142,077 

65,131 

6119 


Table 1. Characteristics of the analyzed data sets, average test performance in RMSE and log likelihood, and average running time. 


by Soudry et al. ( |2014| >, which proposes an online EP 
technique for neural networks with sign activation func¬ 
tions and binary weights, with an extension to continu¬ 
ous weights. As with PBP, EBP also includes a forward 
propagation of probabilities followed by a backward prop¬ 
agation of gradients. However, there are three important 
contributions of PBP with respect to EBP. First, EBP can 
only model data with binary targets and cannot be applied 
when the y n are continuous (as in regression), while PBP 
assumes continuous y n and can be extended to binary tar¬ 
gets using the same method as in EBP. Second, and more 
importantly, EBP with continuous weights only updates the 
mean parameters of the Gaussian posterior approximations. 
In particular, the EBP update operation for each Gaus¬ 
sian approximation includes only equation 0 and does not 
perform the corresponding update for the variance given 
by 0. Therefore, EBP cannot produce accurate uncer¬ 
tainty estimates, as it keeps the posterior variances constant 
during the learning process. Note also that the “learning 
rate” in ([6]) is the variance of the Gaussian approximation. 
In effect, by not updating the variances, EBP makes inef¬ 
ficient updates for the means. Finally, unlike probabilistic 
backpropagation, EBP does not learn the hyperparameter 
for the prior variance A -1 . Instead, EBP keeps A -1 fixed 
to a large initial value. 





Figure 1. Predictions made by each method on the toy data set. 
The noisy observations are shown as black dots, the true data 
generating function is displayed as a black continuous line and 
the mean predictions are shown as a dark gray line. Credible in¬ 
tervals corresponding to ±3 standard deviations from the mean 
are shown as a light gray shaded area. 


5. Experiments 

We evaluate PBP in regression experiments with publicly 
available data sets and neural networks with one hidden 
layer. In PBP we make probabilistic predictions for the 
target variables by using ( [T2| , which approximates 0. 


5.1. Predictive performance 

We first evaluate the predictive accuracy of PBP. Table |T] 
lists the analyzed data sets and shows summary statistics. 
We use neural networks with 50 hidden units in all cases 
except in the two largest ones, i.e.. Year Prediction MSD 
and Protein Structure, where we use 100 hidden units. We 
compare PBP with the variational inference (VI) approach 
described in Section [4] and with standard stochastic gradi¬ 
ent descent via backpropagation (BP). These methods were 
coded using Theano (Bergstra et al. 2010|>. 


The different methods, PBP, VI and BP, were run by per¬ 
forming 40 passes over the available training data, updating 
the model parameters after seeing each data point. The data 
sets are split into random training and test sets with 90% 
and 10% of the data, respectively. This splitting process 
is repeated 20 times and the average test performance of 
each method is reported. In the two largest data sets. Year 
Prediction MSD and Protein Structure, we do the train-test 
splitting only one and five times respectively. The data sets 
are normalized so that the input features and the targets 
have zero mean and unit variance in the training set. The 
normalization on the targets is removed for prediction. 

BP and VI have several hyperparameters that have to be 
optimally adjusted to the data. These are learning rate and 
momentum in BP and VI and weight decay in BP. We select 
these hyperparameter values by maximizing the predictive 
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Figure 2. Average test RMSE and standard errors in the active learning experiments with Boston Housing, Yacht and Energy data sets. 


performance of each method on a validation set with 20% 
of the training data. For this task we use Bayesian opti¬ 
mization (BO) techniques (Snoek et al J 2012) >. In partic¬ 
ular, for each train-test split of the data, we use BO to se¬ 
quentially evaluate the validation performance of 30 differ¬ 
ent hyperparameter configurations. After that, an optimal 
value for the hyperparameters is selected and used to fit the 
model on the training set. 


Table [T] shows the average test root mean squared error 
(RMSE) for each method. On each data set, the results 
of the best method are shown in bold. Overall, PBP and 
BP perform best, with PBP obtaining the best results in 6 
out of 10 data sets. Unlike BP, PBP automatically adjusts 
its hyperparameters and does not require an expensive BO 
search. VI performs rather poorly, evidently due to the use 
of two stochastic approximations. First, VI approximates 
the lower bound on the model evidence by sampling from 
the variational approximation and second, VI further ap¬ 
proximates that bound by subsampling the data. BP and 
PBP only perform the second type of approximation. 


Table |T] also shows the average test log-likelihood for VI 
and PBP, and average running time for each method, in sec¬ 
onds. PBP is considerably better than VI, which performs 
rather poorly. BP and VI are very slow since they have to be 
re-run 30 times to search for their optimal hyperparameter 
values. The BO search in these methods also has a consid¬ 
erable overhead in the smallest data sets. PBP is the fastest 
method since it does not have to select any hyperparameter 
values and is run only once. 


5.2. Multiple hidden layers 

A comparison of the test RMSE obtained by PBP and BP in 
neural networks with up to 4 hidden layers can be found in 
the supplementary material. The experimental protocol in 
these experiments is the same as before. We use networks 
with 50 units in each hidden layer, except in the datasets 


Year and Protein, where we use 100. These results are sim¬ 
ilar to those shown in Table [T] with PBP obtaining usually 
the best results with 2 hidden layers. 

5.3. Toy data set 

We further evaluate the predictive distribution obtained by 
PBP in a toy data set generated by sampling 20 inputs x 
uniformly at random in the interval [— 4 , 4 ]. For each 
value of x obtained, the corresponding target y is com¬ 
puted as y = x 3 + e n , where e n ~ 7 V( 0 , 9 ). We fitted a 
neural network with one layer and 100 hidden units to these 
data using PBP. We compare PBP with VI and BP, using 
40 training epochs in all these methods. We also compare 
with a ground truth generated by Hamiltonian Monte Carlo 
(HMC). HMC is implemented by modifying the MCMC- 
stuff Matlab toolbox ( [Vanhatalo & Vehtari| |2006| > to in¬ 
clude rectified linear activation functions. We run HMC 
by drawing 30,000 samples from the posterior distribution. 

Figure [I] shows the predictions generated by each method. 
PBP and BP are much closer to the ground truth HMC 
than VI. Furthermore, BP and PBP perform similarly, 
even though PBP automatically adjusts its hyperparameters 
while BP has to use BO methods for this task. 

5.4. Active learning 

We performed another series of experiments to evaluate the 
accuracy of the estimates of the posterior variance on the 
weights produced by PBP. For this, we use an active learn¬ 
ing scenario ( [Settles} |2009j ) since in this type of problems 
it is necessary to produce accurate estimates of uncertainty 
for obtaining good performance. 

In these experiments, we used a neural network with a sin¬ 
gle hidden layer and ten hidden units. We split each data set 
into training and test sets with 20 and 100 data instances, 
respectively, and pool sets with all the remaining data. PBP 
is fitted using the training data and then, its performance is 
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Dataset 

LA-R 

LA-A 

EP-R 

EP-A 

PBP-R 

PBP-A 

HMC-R 

HMC-A 

Boston 

9.600±0.154 

9.452±0.111 

8.632T0.231 

8.426±0.264 

6.716±0.500 

5.480±0.175 

5.750T0.222 

5.156±0.150 

Concrete 

16.889±0.182 

16.938±0.173 

16.767±0.174 

16.897±0.151 

12.417±0.392 

11.894±0.254 

10.564±0.198 

11.484±0.191 

Energy 

10.110±0.075 

10.135±0.()70 

3.616±0.101 

3.634A0.159 

3.743±0.121 

3.399±0.064 

3.246T0.067 

3.118±0.062 

Kin8nm 

0.271±0.003 

0.270±0.002 

0.272±0.002 

0.271±0.002 

0.259±0.006 

0.254±0.005 

0.226±0.004 

0.223±0.003 

Naval 

0.015±0.000 

0.015±0.000 

0.015±0.000 

0.015±0.000 

0.015±0.000 

0.016±0.000 

0.013±0.000 

0.012±0.000 

Power Plant 

17.195±0.120 

17.306±0.149 

8.234±0.831 

6.251T0.599 

5.312±0.108 

5.068±0.082 

5.229T0.097 

4.800±0.074 

Protein 

6.165±0.073 

6.227±0.088 

6.118±0.074 

6.151±0.077 

6.133±0.141 

5.903±0.127 

5.613±0.089 

5.727±0.090 

Wine 

0.843±0.011 

0.829±0.010 

0.836±0.010 

0.832T0.009 

0.945±0.044 

0.809±0.011 

0.740±0.011 

0.749±0.010 

Yacht 

15.926±0.409 

15.463±0.310 

15.173±0.214 

15.442±0.390 

5.388±0.339 

4.058±0.158 

4.644±0.237 

3.211±0.120 


Table 2. Average test RMSE and standard errors in active learning. 


evaluated on the test data. After this, one data point is col¬ 
lected from the pool set and then moved into the training 
set. The process repeats until 9 of these active additions to 
the training set have been completed, that is, until we have 
performed 10 evaluations on the test set. The entire pro¬ 
cess, including the random data set splitting, is repeated 40 
times. The pool data is initially lacking the target variables 
and these become available only once the data is moved to 
the training set. As before, we run PBP for 40 epochs. 

We compare PBP with a ground truth obtained by a HMC 
method in which we draw 500 samples from the posterior. 
We also compare with the batch EP algorithm for neural 
networks described by Jyla nki et al.| ( |2014| ). This method 
uses nonlinear activation functions given by the standard 
Gaussian CDF. We further compare with the Laplace ap¬ 
proximation (LA) of |MacKay| ( | 1992c} using the neural net¬ 
work toolbox from Matlab with tanh nonlinearities. In LA 
we approximate the Hessian of the unnormalized posterior 
distribution with the Levenberg-Marquardt approximation 
and assume a diagonal Hessian matrix. This allows LA to 
scale to large data sets and larger networks. We compare 
two versions of PBP, HMC, EP and LA. One in which the 
data from the pool set is collected actively (PBP-A, HMC- 
A, EP-A and LA-A) and another one in which the pool data 
is collected uniformly at random (PBP-R, HMC-R, EP-R 
and LA-R). We re-trained from scratch all the methods af¬ 
ter each new addition to the training set from the pool set. 


To actively collect data from the pool set we follow 


the information-based approach described by MacKay 


( 1992a| . The goal is to maximize the expected reduction 
in posterior entropy that is produced by adding data to the 
training set. This implies choosing the x that maximizes 


H[W, 7 , A | 2>] - E„ | XiC H[W, 7, A | V U {x, y }}, (18) 

where H[-] is the differential entropy. Following 
et al.] ( j 2012 ] >, we can rewrite ( | 1 8| by swapping 
of y and the model parameters W, 7 , A. We finally obtain 


Houlsby 
the roles 


H [y I X, V] - E w , 7jA 1 v H[y I w, 7, A, x]. (19) 

Since the last term in © is constant, we select the x 
that maximizes the entropy of the predictive distribu¬ 
tion p(y\x,T>). Therefore, all the methods select the 
next x with highest predictive variance. 


Table [2] shows the average test RMSE for each method at 
the end of the data collection process. These results show 
that the active learning approach HMC-A is significantly 
better than the random approach HMC-R in the data sets 
Boston, Energy, Power Plant and Yacht. In these data sets 
we also see a significant improvement of PBP-A over PBP- 
R. This indicates that PBP produces useful estimates of 
posterior variance. In these experiments PBP is usually 
better than EP and LA. LA performs poorly because it can¬ 
not correctly select the hyperparameters A and 7 , due to 
the diagonal Hessian approximation, as also observed by 
MacKayl ( jl992bj l. PBP does not have this problem. 

Finally, Figure [2] shows the evolution of the average test 
RMSE for each method during the data collection process 
in the problems Boston, Yacht and Energy. These plots 
indicate that the improvements of PBP-A over PBP-R are 
similar to those of HMC-A over HMC-R. Furthermore, we 
can see that the active learning strategy does not work as 
well in EP and LA as it does in PBP and HMC. 

6. Conclusions and future work 

We have presented probabilistic backpropagation (PBP), a 
new algorithm for scalable Bayesian learning of neural net¬ 
works. PBP uses a product of Gaussians to approximate 
the posterior over weights. The parameters of these Gaus¬ 
sians are updated in a two stage process similar to the one 
used by the backpropagation algorithm. First, probabilities 
are propagated forward through the network to obtain the 
marginal likelihood and second, the gradients of this quan¬ 
tity with respect to the Gaussian parameters are propagated 
backwards. These gradients are finally used to update the 
parameters of the approximation to the posterior distribu¬ 
tion. Experiments on ten datasets show that PBP makes 
accurate predictions. Furthermore, we also show that PBP 
produces useful estimates of the posterior variance on the 
network weights. In summary, PBP is a fast method with 
state-of-the-art performance for Bayesian learning of neu¬ 
ral networks. As future work we plan to address multi¬ 
label and multi-class problems. We will also make PBP use 
mini-batches and output estimates of the model evidence 
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1 Derivation of the gradients 


In this section we derive the gradient of the logarithm of the marginal likelihood, that is, log Z, with respect to the 
means and variances of the network weights in the Gaussian approximation q. In traditional backpropagation we 
have, for each neuron j, one variable 6 1 containing the gradient of the network error with respect to the input or 
activation for neuron j. In PBP, the corresponding algorithm is very similar, with the difference that we now have 
two variables for each neuron j instead of only one. We have one variable ()'" that contains the gradient of log Z 
with respect to the mean of the activation for neuron j. Additionally, there is another variable S'- that contains the 
gradient of log Z with respect to the variance of the activation for neuron j. 

The mean and variance of the output of unit j are defined as mj and i)J, respectively. The mean and variance 
of the activation or input for unit j are defined as m“ and v°, respectively. We have that, becauseof the ReLU 
activation function, 


m) = $(aj) 


m. 


v nj 




( 1 ) 

( 2 ) 


where 7 j = cij = ra “/-^/vj and <fi and $ denote the standard Gaussian pdf and cdf. For the single 

neuron in the last layer we have that rn| = m“ and v | = u“. 

We have that m“ and v° are given by 
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where I(j) is the set of neurons whose output is the input to neuron j, mf :j and vlf,- are the mean and variances of 
the weight connecting neurons i and j. Therefore, 


dnij 1 w dnij 


dm l \/\I(j)\ 3,1 ’ 

dvf 

= 0 , 

dvj 2 mfvjj 

dvj 

+ v Z,i 

dm\ \I(j)\ 

dvt 

\m 

dmf mj 

dm°; 

= 0 , 

dm Zi VKW ’ 

dv i,j 

dvf 2v j m Zj 

dv z 1 

[mf\ 2 + Vj 

dm%j |/(*)| 

dv T,j 

\m\ 5 


(5) 

(6) 

(7) 

( 8 ) 


1 



We now compute the gradient of 7 j and aij with respect to m“ and v°: 
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We now define the variables d"' and Sj to be 
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where the sum is over each neuron k to which neuron j sends signals. The above rules can be recursively written 
as follows: 
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We can then write the required terms „ 
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Table 1: Average Test RMSE in the experiments with deep neural networks. 


Dataset 

BPi 

bp 2 

bp 3 

bp 4 

PBPi 

PBPo 

PBPs 

pbp 4 

Boston 

3.228±0.1951 

3.185±0.2365 

3.0 1 9=hO. 1 848 

2.874±0.1570 

3.014±0.1800 

2.795±0.1590 

2.938±0.1645 

3.088±0.1519 

Concrete 

5.977±0.2207 

5.396±0.1273 

5.568±0.1271 

5.530±0.1390 

5.667±0.0933 

5.241±0.1164 

5.732±0.1075 

5.956±0.1597 

Energy 

1.185±0.1242 

0.676±0.0367 

0.628±0.0278 

0.667±0.0321 

1.804±0.0481 

0.903±0.0482 

1.237±0.0592 

1.176±0.0552 

Kin8nm 

0.091±0.0015 

0.073±0.0009 

0.071 ±0.0006 

0.071 ±0.0009 

0.098±0.0007 

0.071±0.0005 

0.073±0.0007 

0.075±0.0008 

Naval 

0.001 ±0.0001 

0.001±0.0000 

0.001 ±0.0001 

0.001±0.0001 

0.006±0.0000 

0.003±0.0001 

0.010±0.0013 

0.004±0.0011 

Power Plant 

4.182±0.0402 

4.220±0.0744 

4.112±0.0378 

4.184±0.0591 

4.124±0.0345 

4.028±0.0347 

4.065±0.0382 

4.075±0.0366 

Protein 

4.539±0.0288 

4.188±0.0313 

4.014±0.0326 

3.960±0.0110 

4.688±0.0115 

4.251±0.0153 

4.094±0.0285 

3.970±0.0376 

Wine 

0.645±0.0098 

0.651±0.0108 

0.652±0.0101 

0.650±0.0158 

0.635±0.0079 

0.643±0.0077 

0.641±0.0086 

0.637±0.0079 

Yacht 

1.182±0.1645 

1.542±0.1920 

1.107±0.0863 

1.265±0.1287 

1.015±0.0542 

0.848±0.0495 

0.893±0.0991 

1.711±0.2288 

Year 

8.932±NA 

8.976±NA 

8.933±NA 

9.045±NA 

8.869± NA 

8.918±NA 

8.874±NA 

8.934±NA 


Finally, we have that 


d log Z 

_ r dm f 

dv a 

- + 5? 1 , 

(21) 

dmfj ' 

3 dm™j 

3 dm™j 

d log Z 

<Kj 

_ c m dm i 

3 dvfj 

cv dv? 
j dvV’j ' 

(22) 


2 Results with neural networks including more than one hidden layer 

We repeated the experiments from Section 5.1 in the main document for the methods BP and PBP, using neural 
networks with 2, 3 and 4 hidden layers. We used networks with 50 units in each hidden layer, except in the datasets 
Year and Protein , where we used 100. Table 1 shows the average test RMSE and the corresponding standard errors 
obtained by PBP;,, and BP, : , where x is the number of hidden layers in the network. PBP has the best overall 
predictive performance, with PBP 2 achieving the best results in 5 datasest. Note that the optimal number of hidden 
layers in PBP is problem dependent. In datasets such as Wine and Year one single hidden layer is optimal, while 
in Protein we find that 4 hidden layers is better. 


3 Error in the second approximation in equation (12) in the main text 

In this section we evaluate the error in the second approximation performed in equation (12) in the main document. 
This approximation consists in replacing the Student’s t density with a Gaussian density that has the same mean 
and variance. This approximation becomes more and more accurate as the degrees of freedom in the Student’s t 
density increase. This will often be the case as we iterate over the data and we reduce our uncertainty on the value 
of the noise parameter 7. We evaluated the relative error in log Z caused by this approximation as PBP iterates over 
the data of the Boston Housing dataset in the experiments of Section 5.1 in the main document. The left plot in 
Figure 1 shows the error during the first 100 iterations of PBP over the individual datapoints. The right plot shows 
the error during the last 100 iterations of the method. We can see that the error is very small in the second case. In 
particular, at this stage we are highly confident on the value of the noise parameter 7 and the parameters op and 
/3 7 in the posterior approximation take relatively high values. This increases the number of degrees of freedom of 
the Student’s t density in equation (12), what improves the quality of the Gaussian approximation. 


4 List of approximations 

In this section we list all the approximations performed by the method PBP. The list of approximations is 

• We use expectation propagation (EP) to adjust a parametric approximation, given by equation (8) in the main 
document, to the exact posterior distribution, given by equation (3) in the main document. 

• In our implementation of EP, we refine the parameters a 7 , /? 7 , a x and B x of the posterior approximation by 
matching the first and second moments of A and 7. The KL divergence would be minimized by matching the 
expectation of the sufficient statistics of a Gamma distribution, but this does not have an analytical solution. 
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Figure 1: Relative error in the approximation of log Z as PBP iterates over the data of the Boston Housing dataset, 
in the experiments of Section 5.1 of the main paper. Left, relative error during the first 100 iterations of the 
method over the individual datapoints. Right, relative error during the last 100 iterations of the method. We can 
see that the error is very small during the last iterations. At this stage we are highly confident on the value of the 
noise parameter 7, the parameters a 7 and /? 7 in the posterior approximation take relatively high values and the 
degrees of freedom of the Student’s t density in equation (12) are high, what increases the quality of the Gaussian 
approximation. 


• We approximate the normalization constants in equations (11) and (12) of the main document by replacing 
a Student’s t density with a Gaussian density that has the same mean and variance. 

• EP requires to keep in memory one approximate factor for each exact factor in the numerator of the posterior 
distribution. With massive data the number of exact likelihood factors is very large and keeping in memory 
all the corresponding approximate factors is inpractical. To avoid this, we do not keep these approximate 
factors in memory and we do not remove them from the current approximation before processing each 
datapoint. This is equivalent to doing multiple ADF passes through the data, treating each likelihood factor 
as a novel example. A disadvantage of this approach is that it can lead to underestimation of the posterior 
variance when too many iterations are done over the data. 


5 Derivations of equations (9) and (10) in the main text 


Let the Gamma density be defined as Gamma(a:|a,/3) = f3 a x a 1 exp{— xf3'\T{a) 1 . We denote the normalization 
constant of /(a;)Gamma(a:|a, /3) by H{a , /?). In particular, 


H(a,f3) = J /(a;)Gamma(a:|a, /3) dx. 


(23) 


Note that we explicitly write H as function of a and /3. Then we have that the first and second moments of the 
normalized version of /(a;)Gamma(a;|a, (3) are given by 


1 


H(a,p) 

1 


H(a,/3) 


x/(a;)Gamma(a;|a,/3) dx = 

r 

a; 2 Gamma(a;|a, /3) dx = 


H{a + 1, P)a 
H{a,p)p 

H(a + 2, P)a(a + 1) 

H(o^W 2 


(24) 

(25) 
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Thus, each moment can be easily approximated given a procedure to approximate the normalization constant 
H(a , ft). For this, we only have to substitute H(a, ft), H(a + 1, ft) and H(a + 2, ft) in the previous expressions 
with their corresponding approximations. Note that the mean and variance of Gamma(x|a, ft) are given by a/ft 
and a/ft 2 , respectively. We can then find the new parameters a new and ft new of a Gamma distribution that has the 
same mean and variance as the normalized version of f{x)Gamma(x\a, ft) by solving the system of equations 
given by 


ct new H(a + l,ft)a a new _ H{a + 2, ft) a {a + 1) f H {a + 1, ft) a \ 2 

0”™ _ H{a, ft)ft ’ [/3 new ] 2 - H (a, ft)ft 2 \ H{a,ft)ft J ’ '" f ” 

Let Z = H(a, ft), Z x = H{a + 1, ft) and Z 2 = H{a + 2, ft). Then 

a new = [ZZ 2 Zft 2 {a+l)/a. - 1.0 ] _1 , (27) 

ft aew = [Z 2 Zft\a+l)/ft- ZxZ-'a/ftY 1 . (28) 

6 EP updates for the approximate factors corresponding to the prior 

The only prior factors that need to be processed multiple times using expectation propagation are the factors in 
equation (2) in the main document. The other Gamma priors on A and 7 have the same functional form as the 
posterior approximation q. This means that they need to be incorporated only once into q since any removal and 
posterior re-incorporation of these factors would not produce any improvement in q. 

We re-write here the expression for the prior factors than need to be processed multiple times, that is, equation 
( 2 ) from the main document: 


L Vi Vj-i+l 

p(W|A) = ]J]J A^ 1 ). 

1=1 i=1 j=1 

We also re-write here the expression for the posterior approximation q: 


(29) 


g(w, 7 , A) = 

We denote each exact factor in (29) by 


L Vt V1-1 + I 

nn n 

1 = 1 2=1 j = 1 


Gamma( 7 |a 7 , y S 7 )Gamma(A|a A , /3 A ). (30) 


=Af(wij t i\0,X~ 1 ) . (31) 

Each of these exact factors is approximated by a corresponding approximate factor given by 

A) — (//.Ujjjftifijjj , v '/'/.j jGammaf A|o: V y /, ft'ij.i ), (32) 

Initialliy all the f t jj are uniform, that is, rhij t i = 0, Vijj = oo, 07,7 = 1 and fftjj, = 0. EP starts to incorporate all 
the fij i into q once it has already incorporated the Gamma priors for A and 7. The first time f l7 j is incorporated 
into q we update fijj and q as follows: 

rhij t t = 0, v ijt i = fto/ (ao - 1), (33) 

m ij,i = 0, v ijt i = /?o /(«o - 1) , (34) 

where ctg and ftj) are the parameters of the Gamma prior on A. These rules guarantee the matching of means 
and variances on Wijj after approximating the Student’s t density in equation (11) in the main document with a 
Gaussian that has the same mean and variance. 
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On successive iterations, we refine f, :) i by first removing this approximate factor from q to obtain a cavity 
distribution. This cavity is computed as the ratio of q and f, h i. The cavity marginal distribuion on w lh i and A is 
therefore 


q\ l i' l (wijj, A) = tA u,( )Gamma(A|a 


\ij,l 


\ij,l 

> 


Pi™) 


(35) 


where 




v ij\i v i 0 \i 


-1 


\ij,l X 

&X = a ~ a ij,l + 1 ■ 






p \ ij ’ 1 =( 3 X ~ 4 m . 


(36) 

(37) 


After this, we update the parameters of q to match moments between q(wijj. A) and the normalized version of 
f(wijj, X)q^ ,l (wij i, A). For this, we use expression (11) in the main text to approximate the normalization 
constant of A). This last step is obtained by replacing q in equation (11) in the main text 

with the cavity distribution. Equations (6), (7), (9) and (10) from the main document are then used to obtain the 
new parameters m t jj, v l;h i, a x and S x for the posterior aproximation. Finally, we update the parameters for the 
approximate factor f t] j using 






v ij,l ~ 


X \ij,l 
a ij,l = OL - 


m ij,l — ^ ij,l 

fiij.l = /3 A - 


~ m\ ij ’ l [v\ ij ’ 1 ] 1 

■ 


(38) 

(39) 
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